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Abstract 

We model the generation of words with independent unequal prob- 
abilities of occurrence of letters. We prove that the probability p(r) of 
occurrence of words of rank r has a power asymptotics. As distinct from 
the paper published earlier by B. Conrad and M. Mitzenmacher, we give 
a brief proof by elementary methods and obtain an explicit formula for 
the exponent of the power law. 
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As is known, in English texts the word "the" occurs most often, the next one 
(with respect to the occurrence) is the word "of" , therefore, each word-form w 
in a text is associated with its rank r(w), i.e., the number in the frequency list. 
The frequency f(w) of the word w in a text is defined as the ratio of the number 
of occurrences of the word w to the length of this text. According to the Zipf law 
(established in the first half of the last century), the product of r(w) and f(w) 
approximately equals a constant value; for an English text it equals 0.1. At the 
present time, owing to Google Labs, available are results of the recognition of 
4% of all books ever published pQ. According to our numerical tests, the OLS 
line constructed by logarithmically transformed (for convenience, we use the 
decimal logarithm) values of r and / of one hundred English words used most 
often in 2000, takes the form Igf = -1.05182 - 1.00026 lgr. The fact that the 
modern data (of the indicated sample) agree with the Zipf law is the starting 
point of our work. 

The Zipf law was interpreted in many ways; see [2] for a brief review of 
relevant papers published in Russia and the inference of this law from the gen- 
eral properties of semiotic systems proposed by V.P. Maslov. In this paper we 
consider a more traditional explanation of the Zipf law, namely, the "monkey" 
model. We assume that a "monkey" independently with equal probabilities 
types any of 26 English letters or does the space with the probability po [3], 
[4], [5]. We understand a word as a sequence of letters between two spaces. Evi- 
dently, occurrences of all words of the same length have equal probabilities, and 
their ranks run in succession. We denote by p(r) the probability of obtaining a 
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word of the rank r. Evidently, 3c\,C2'. c\ < lnp(r) — alnr < C2, where in the 
case po — 1/27 we have a = In27/ln26. Really, various letters in a text have 
various frequencies of occurrence; in Russian they are approximately determined 
by the law established by S. Gusein-Zade [5], [7] (probabilities are proportional 
to average values of order statistics of the exponential distribution). 

Not long time ago B. Conrad and M. Mitzenmacher [5] have generalized the 
"monkey" models for the case when letters have different probabilities of occur- 
rence. They proved an inequality analogous to (flj below by using the Tauber 
theorems for generating functions. In conclusion of the mentioned papers the 
authors write "It would, of course, be pleasant to have a proof of the power law 
behavior in the case of unequal probabilities that avoids some of this technical 
machinery" and discuss the possible ways to obtain such a result. In this paper 
we propose a short proof obtained in another way, namely, by making use of 
properties of the Pascal pyramid. Note that in the case of a Markovian depen- 
dence of probabilities of letters the power law does not necessarily holds. We 
study the power law (as well as the exponential one) for Markovian chains (and 
obtain explicit formulas for the corresponding parameters) in a separate paper. 

Theorem 1 Let probabilities of letters equal pi, . . . ,p n , n > 1 (Y^i = \Pi = 1 — 
Po), and let 7 be a root of the equation X)"=iPZ = 1- Then 

3ci,c 2 : ci < \np(r) - lnr/7 < c 2 . (1) 

Proof. Without loss of generality, for convenience, we introduce an empty 
word; we assume that its rank equals one, while other words have greater ranks. 
All words that contain k\ letters of the 1st kind, . . ., k n words of the nth kind 
have one and the same probability Pr(fc) = p^ 1 . . .p^"pQ, and their ranks run 
in succession. The number of such words is defined by the multinomial coeffi- 
cient M(ki, . . . , k n ), M(ki,...,k n ) = ^^i"^"^ ■ Let us fix some probability / 
(/ G (0, 1]) and denote by Q(f) the rank of the last word w, whose probability is 
not less than / in the sorted (in the nonincreasing order of probabilities) in the 
list of all words. Thus, for example, Q(po) — 1 (we take into account only the 
empty word), Q{p'po) — 2, where p' — maxjpi, . . . ,p n } (here we assume that 
the maximum is unique), etc. Evidently, the function Q is nonincreasing, piece- 
wise constant (it takes on only positive integer values), and tending to infinity 
as / ->• 0. 

The rank of a word w (i.e., the number of words at the beginning of the 
mentioned list up to w inclusive) equals M (k±, . . . , k n ), where the sum is taken 
over all n-tuples such that Pr(fc) > /. 

Let Q(x) = Q(p e~ x )i x > 0. Evidently, Q(x) is 

Y, M(k u ...,k n ), (2) 

k>0:L 1 k 1 + ...L rl k n <x 

where Li = — lnp^, i = 1, . . . , n. We have to prove that the value In Q{x) —jx is 
bounded. This is a property of a multidimensional generalization of the Pascal 
pyramid. Note that we have & = 1 ~ Po- Introduce L\ = 7^ and 



2 



assume that x' = jx. We have Yl7=i e Li = 1j an< ^ correlation @ takes the 
form 

M(k u ...,k n ). 

k>0:L' 1 k 1 + ...L' n k n <x' 

We have to prove that by subtracting x' from the logarithm of this sum we obtain 
a bounded value. Thus, the general case is reduced to the case when 7=1. 

Let us now immediately prove the boundedness of the difference, assuming 
that 

n 
i=l 

The function Q(x) is defined for x > 0. For convenience, we extend it with zero 
for x < 0. Analogously we extend the Pascal pyramid with zero for negative 
fcj. Then any number in the Pascal pyramid, except M(0, . . . , 0), is the sum of 
successive numbers such that one of their indices is less by one. This leads to 
the functional equation Q{x) — Q{x — Li) + . . . + Q{x — L n ) + Qo{x), where Qo 
is the Heaviside step function (it equals zero for negative values of the argument 
and does one for its nonnegative values). 

For x > L' = max{Li, . . . , L n } we obtain the recurrent correlation 

Q'(x) = Q'(x - Li) + . . . + Q'(x - L n ), (4) 

where Q'{x) — Q(x) + l/(n — 1). If all fractions Li/Lj, i,j = l,...,n, are 
rational, then we obtain a recursive sequence, and one can prove the theorem 
with the help of known explicit formulas for such sequences. In a general case, 
instead of nuclear formulas we use simple inequalities. 

Multiplying correlation © by e x , we obtain that the exponential function 
satisfies Eq. (0}. Since the function Q'(x) is piecewise constant, the product 
q(x) = Q'{x)e" x is piecewise continuous, and hence there exist positive c\ and 
C2 such that ci < q(x) < c-i for all < x < L' . In the recurrent correlation ((4]) 
we replace addends in the right-hand side with their lower (upper) bounds and 
thus widen the interval, where the inequality c\ < q(x) < ci is fulfilled, up to 
x < L' + L", where L" = min{Li, . . . ,L n }. Repeating this procedure several 
times, in a finite number of steps we prove the inequality for any arbitrarily 
large x. By applying the logarithmic transform to the inequality we conclude 
that \nQ'(x) — x is bounded, and then so is the difference lnQ(x) — x, which 
was to be proved. 

Thus, we have proved the power order of the asymptotics and obtained an 
explicit formula for it. In the case of the function Q{x) the parameter of the 
power law is defined by the solution 7 of the equation y^ .-, pj = 1. But if we 
consider the probability p(r) of the occurrence the rth word in the frequency 
list, then we obtain the power order of the asymptotics with the exponent I/7 
(i.e., greater than one). Note that the verification of the power law with respect 
to the frequency of occurrence of one thousand (rather than one hundred) most 
often cited words in several European languages included in the database of 
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Google Labs (French, German, Russian, Spain, and several variants of English) 
shows that the exponent for each language essentially differs from one and is 
close to the value calculated by our formula. 

We are grateful to V.D. Solovyev for useful discussions. 
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